Skip to content

feat(ci): add 6.1 -> 6.18 cross snapshot testing#5856

Merged
JackThomson2 merged 11 commits intofirecracker-microvm:mainfrom
JackThomson2:feat/6-18-cross
Apr 24, 2026
Merged

feat(ci): add 6.1 -> 6.18 cross snapshot testing#5856
JackThomson2 merged 11 commits intofirecracker-microvm:mainfrom
JackThomson2:feat/6-18-cross

Conversation

@JackThomson2
Copy link
Copy Markdown
Contributor

Changes

The cross snapshot pipeline was heavily neglected (and actually currently skipping all tests) so firstly spent some time this this up and ensure the tests run properly. Expanded our coverage of newly onboarded instances we had yet to add to the pipeline yet.

Hardened our testing on the snapshot restore, checking the clock is working as expected in the guest, stronger checks on the networking, better check the disk etc.

Added the new 6.18 target which we will use to test if we can restore a snapshot created on 6.1 with.

Also took the opportunity to fix up the ordering of the pipeline so we're not blocking on all instances to complete before running the restore tests

Link to run on my pipeline: https://buildkite.com/firecracker/jack-a-b/builds/97/steps/canvas
Link to an example negative test proving works with incompatible kernels: https://buildkite.com/firecracker/jack-a-b/builds/94/steps/canvas?sid=019dbb56-6a1c-49d9-ab8f-cc74a38f3301&tab=output

Reason

...

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

Two bugs were preventing cross-kernel restore tests from running:

  1. The glob pattern only searched one level deep under
     snapshot_artifacts/, but Phase 1 artifacts are nested under an
     additional test-name directory. Use recursive glob (**/) to find
     snapshot directories regardless of nesting depth.

  2. The "None" CPU template was only added to the search list on
     x86_64, so on aarch64 instances where get_supported_cpu_templates()
     returns an empty list (e.g. Neoverse N1), the loop yielded zero
     pytest parameters and the test was silently skipped. Always
     include "None" in the search list.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add AL2023/linux_6.18 as a restore-only platform in the cross-snapshot
pipeline for both x86_64 and aarch64. Snapshots created on 6.1 hosts are
restored on 6.18 hosts to validate cross-kernel compatibility. The 6.18
platform is scoped to pipeline_cross.py only since 6.18 agents exist
exclusively in the private Buildkite queue.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
guest_run_fio_iteration ran fio in the background and only checked that
the process launched, not that IO actually succeeded. Run fio in the
foreground with JSON output and assert that bytes were read from the
block device. This addresses the TODO about verifying the root device
is not corrupted after snapshot restore.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Check that /dev/hwrng is functional after restoring a snapshot on a
different host kernel version.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.87%. Comparing base (458483a) to head (2ca96c3).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5856   +/-   ##
=======================================
  Coverage   82.87%   82.87%           
=======================================
  Files         276      276           
  Lines       29728    29728           
=======================================
  Hits        24637    24637           
  Misses       5091     5091           
Flag Coverage Δ
5.10-m5n.metal 83.17% <ø> (ø)
5.10-m6a.metal 82.51% <ø> (+<0.01%) ⬆️
5.10-m6g.metal 79.78% <ø> (ø)
5.10-m6i.metal 83.17% <ø> (-0.01%) ⬇️
5.10-m7a.metal-48xl 82.49% <ø> (ø)
5.10-m7g.metal 79.78% <ø> (ø)
5.10-m7i.metal-24xl 83.14% <ø> (ø)
5.10-m7i.metal-48xl 83.15% <ø> (+<0.01%) ⬆️
5.10-m8g.metal-24xl 79.78% <ø> (+<0.01%) ⬆️
5.10-m8g.metal-48xl 79.78% <ø> (+<0.01%) ⬆️
5.10-m8i.metal-48xl 83.15% <ø> (+<0.01%) ⬆️
5.10-m8i.metal-96xl 83.15% <ø> (ø)
6.1-m5n.metal 83.20% <ø> (+<0.01%) ⬆️
6.1-m6a.metal 82.53% <ø> (-0.01%) ⬇️
6.1-m6g.metal 79.78% <ø> (ø)
6.1-m6i.metal 83.19% <ø> (ø)
6.1-m7a.metal-48xl 82.52% <ø> (+<0.01%) ⬆️
6.1-m7g.metal 79.78% <ø> (-0.01%) ⬇️
6.1-m7i.metal-24xl 83.21% <ø> (ø)
6.1-m7i.metal-48xl 83.21% <ø> (+<0.01%) ⬆️
6.1-m8g.metal-24xl 79.78% <ø> (ø)
6.1-m8g.metal-48xl 79.78% <ø> (ø)
6.1-m8i.metal-48xl 83.22% <ø> (+<0.01%) ⬆️
6.1-m8i.metal-96xl 83.21% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JackThomson2 JackThomson2 added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Apr 23, 2026
JamesC1305
JamesC1305 previously approved these changes Apr 24, 2026
Comment thread tests/integration_tests/functional/test_snapshot_restore_cross_kernel.py Outdated
Comment thread .buildkite/pipeline_cross.py Outdated
@JackThomson2 JackThomson2 force-pushed the feat/6-18-cross branch 2 times, most recently from b31460e to 826dbf5 Compare April 24, 2026 15:08
Record guest CLOCK_MONOTONIC in phase1 just before snapshotting, then
read it back after cross-kernel restore and assert the delta is small.
Firecracker is supposed to resume MONOTONIC from capture time (see
a1fd537 "fix(kvm-clock): do not jump monotonic clock on restore"),
so the delta should be near zero regardless of how long phase1 and
restore are apart in the pipeline. A large delta indicates MONOTONIC
jumped forward - a kvm-clock regression that could surface only on
some host-kernel combinations.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add check_network_data_integrity helper that generates random bytes on
the host, pushes them to the guest via SSH command-line (base64-encoded
to survive argv), has the guest decode and sha256 them, and asserts the
guest-side hash matches the host-side hash. This exercises the full
virtio-net RX path end-to-end beyond simple connectivity checks.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
MemoryMonitor's is_guest_mem heuristic matches a single guest-sized VMA,
but _test_balloon inflates the balloon after restore, and
GuestRegionMmapExt::discard_range overlays MAP_FIXED anonymous mmaps on
the reclaimed ranges (a workaround specific to private file-backed
mappings from snapshot restore). This fragments the 512 MiB guest VMA
into ~190 smaller ones, none of which match the heuristic, and their RSS
(~336 MiB) is counted as VMM overhead.

This is the only cross-kernel test that inflates the balloon post-
restore, and its purpose is validating cross-kernel compatibility, not
VMM memory overhead, so the monitor is skipped here as it already is in
test_snapshot_phase1.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
The perms_aarch64 loop expects aarch64 phase1 snapshots to exist for
restore steps to consume, but the snapshot-create group was x86-only,
so every aarch64 restore step failed at artifact download. Add an
aarch64 snapshot-create group and enable test_snapshot_phase1 on arm.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add m8i.metal-48xl (Intel Granite Rapids), m6g.metal (Graviton2) and
m8g.metal-24xl (Graviton4) to the cross-restore pipeline. These pick up
same-instance cross-kernel coverage only; cross-instance restore
permutations are unchanged.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
Previously every restore step waited for the entire snapshot-create
group to finish via a pipeline-wide wait step. Each restore only needs
its own source snapshot, so key each create step by instance/kv and
have each restore depends_on the specific source it consumes. Restores
now start as soon as their source snapshot is ready.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
@JackThomson2 JackThomson2 enabled auto-merge (rebase) April 24, 2026 15:44
@JackThomson2 JackThomson2 merged commit 37d2b74 into firecracker-microvm:main Apr 24, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Awaiting review Indicates that a pull request is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants